# Lab 15 - More hypothesis testing

We will start with another example of hypothesis testing on qualitative data with multiple categories, and then introduce a hypothesis test for comparing quantitative data based on comparing the means of two groups.

First, let's import the necessary libraries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

###  Are the complaints in zip code 10468 different than in New York City as a whole?

Zip code 10468 contains Lehman College.  We will test whether the distribution of complaints in this zip code is the same as the distribution of complaints in all of New York City.

Null hypothesis:  The complaints in zip code 10468 have the same distribution as complaints made in New York City as a whole.

Alternative hypothesis:  The complaints in zip code 10468 have a different distribution as complaints made in New York City as a whole.

Test statistic:  The Total Variation Distance (TVD) from Lab 14.  Recall the TVD is computed between two distributions by taking the absolute difference of the probabilities for each category, summing them, and dividing by 2.  
Ex.  `np.abs(df["Distribution 1"] - df["Distribution 2"]).sum()/2`

Load your CSV file with call data from March 3 and 4, 2019 into the dataframe `calls`.  Read the `Created Date` column in as a date/time.

Display the `calls` dataframe to make sure it was loaded into memory correctly.  If you want to see all column, run `pd.set_option("display.max_columns",None)` first.

First, get the probabilities of each complaint type in your whole dataframe and store them in the variable `nyc_probs`.  

<details> <summary>Hint:</summary>
The function `value_counts()` computes how many of each complaint happened, and adding the parameter `normalize = True` will divide each count by the total number of complaints, giving the probability.
</details>

Next, create a new dataframe of only the calls from zip code 10468 (or a zip code of your choice.) 

<details> <summary>Hint:</summary>
Create a filter and then apply it.  You will have to look in the dataframe to see what the column containing the zip code is called.
</details>

<details> <summary>Answer:</summary>
lehman_filter = calls["Incident Zip"] == 10468
lehman_calls = calls[lehman_filter]
</details>

Compute the probabilities of the different complaints in the 10468 zip code.

Just looking at the first few probabilities in the two distributions, do you notice any differences?

We will now perform the hypothesis test to formally check if there is a difference between the distributions.  We will create a new dataframe containing the two distributions, and then can continue as in the jury panel example in Lab 14.  To make a new dataframe called `df` from the probabilities of the NYC complaints, type `df = pd.DataFrame(nyc_probs)` below.

Check that the dataframe was created correctly.

Next let's add a column to our dataframe `df` containing the 10468 complaint probabilities.  Again display the new dataframe to check your code worked.

Some complaints showed up in the NYC calls, but not the calls from zip 10468.  How can you tell which complaints these are in the dataframe?

Complaints were in the NYC calls but not the 10468 calls have `NaN` for the probability in the 10468 column.  If we wanted to replace `NaN` with a number, what should the number be?

To replace the NaNs with 0's, type `df = df.fillna(0)` below and run it. Note that you need `df=` to save the changes you made.

Check that the NaNs have been replaced by 0's by displaying `df`.

We want to know if the differences between the 10468 complaint distribution and the NYC complain distribution are just due to chance or because the distributions are different.  To test this, we need to:

1. Compute the size of the sample of 10468 complaints (to know how large our samples in step 3 should be). 
2. Compute the Total Variation Distance (TVD) between the 10468 and NYC complaint distributions
3. Compute the Total Variation Distances between samples from all complaints and the NYC complaint distribution and make a histogram of them.
4. Compare the TVD from step 2 with the histogram from step 3, and accept or reject the null hypothesis.

Do step 1: compute the size of the sample of 10468 complaints (to know how large our samples in step 3 should be)

Do step 2: compute the Total Variation Distance (TVD) between the 10468 and NYC complaint distributions.

That is, compute the TVD between the `Complaint Type` and `10468 Complaints` columns in your dataframe `df`.

<details> <summary>Answer:</summary>
np.abs(df["Complaint Type"] - df["10468 Complaints"]).sum()/2
</details>

We will break step 3 up.  First, let's generate one sample from the NYC complaint distributions.  Instead of simulating the sample like in the previous hypothesis testing examples, simply take a sample of size 144 from the `calls` dataframe.

Compute the probabilities for the complaints in the sample, and add them to dataframe `df` as a new column.  Remember to replace the NaNs with 0's.

<details> <summary>Answer:</summary>
sample_counts = sample["Complaint Type"].value_counts(normalize = True)
df["Sample complaints"] = sample_counts
df = df.fillna(0)
df
</details>

Compute the TVD between the sample complaints and all NYC complaints.

<details> <summary>Answer:</summary>
np.abs(df["Sample complaints"] - df["Complaint Type"]).sum()/2
</details>

Now we want to repeat these steps (sample from `calls` and compute the TVD between the sample and all NYC complaints) many times.

<details> <summary>Answer:</summary>
tvds = []
for i in range(10000):
    sample = calls.sample(144)
    sample_counts = sample["Complaint Type"].value_counts(normalize = True)
    df["Sample complaints"] = sample_counts
    df = df.fillna(0)
    sample_tvd = np.abs(df["Sample complaints"] - df["Complaint Type"]).sum()/2
    tvds.append(sample_tvd)
</details>

Display the histogram of the TVDs:

Finally, compare the TVD between the 10468 sample and all NYC complaints with the histogram, and decide whether to reject the null hypothesis or not.

### Comparing mean trip distances with 1 or more than 1 passengers

Let's return to our hypothesis from Lab 12: Taxis with more than 1 passenger take longer trips on average than taxis with more than 1 passenger.

We can now formally test this hypothesis:

#### Hypothesis testing step 1

Null hypothesis: The mean trip distance is the same for taxis with 1 or more than 1 passengers.  
Alternative hypothesis:  Taxis with more than 1 passenger take longer trips on average than taxis with more than 1 passenger.

Before proceeding further, load the green taxi trip data into the dataframe `taxi`.

#### Hypothesis testing step 2

Our test statistic will be the difference in mean trip distance between trips with only 1 passenger and trips with 2 or more passengers.  To calculate the test statistic for the data:

1. Compute the mean trip distance when there is only 1 passenger.
2. Compute the mean trip distance when there are 2 or more passengers.
3. Subtract mean 1 from mean 2 and take the absolute value.

Let's do step 1: compute the mean trip distance when there is only 1 passenger.

<details> <summary>Hint:</summary>
a. Use a filter to create a new dataframe containing only trips with 1 passenger.
b. Compute the mean trip distance in the new dataframe.
</details>

Step 2: Compute the mean trip distance when there are 2 or more passengers.

Step 3: Subtract mean 1 from mean 2 and take the absolute value.

This value is the test statistic for our data.

#### Hypothesis testing step 3
Step 3 is to simulate the test statistic assuming the null hypothesis is true.

We will do this by permuting (randomly changing) the passenger count data in the dataframe, without changing any other columns.  If the passenger count doesn't matter, then switching it around shouldn't change the difference in means. 

First let's make a new dataframe called `permuted_taxi` by loading the data from the CSV file again.

The following code will permute the `passenger_count` column and then display the new dataframe:

In [None]:
permuted_taxi['passenger_count'] = np.random.permutation(permuted_taxi['passenger_count'])
permuted_taxi.head()

Compare the first few rows of `permuted_taxi` with the first few rows of `taxi`.  Some of the `passenger_count` values should have changed.

Compute the difference between mean trip distance with 1 passenger and the mean trip distance with 2 or more passengers using the `permuted_taxi` dataframe.   

<details> <summary>Answer:</summary>
solo_filter = permuted_taxi["passenger_count"] == 1
solo_taxi = permuted_taxi[solo_filter]
solo_mean = solo_taxi["trip_distance"].mean() 

multi_filter = permuted_taxi["passenger_count"] >= 2
multi_taxi = permuted_taxi[multi_filter]
multi_mean = multi_taxi["trip_distance"].mean() 

sample_ts = np.abs(solo_mean - multi_mean)
sample_ts
</details>

Now, let's repeat these steps (permuting the `passenger_count` column and computing the difference between the two means in the permuted dataframe) many times, storing the mean differences is a list.

Remember, use a small number of iterations to test your code, so it is faster.

<details> <summary>Answer:</summary>
differences = []
for i in range(10000):
    solo_filter = permuted_taxi["passenger_count"] == 1
    solo_taxi = permuted_taxi[solo_filter]
    solo_mean = solo_taxi["trip_distance"].mean() 

    multi_filter = permuted_taxi["passenger_count"] >= 2
    multi_taxi = permuted_taxi[multi_filter]
    multi_mean = multi_taxi["trip_distance"].mean() 

    sample_ts = np.abs(solo_mean - multi_mean)
    differences.append(sample_ts)
</details>

Graph the histogram of the differences in means assuming the null hypothesis is true.

#### Hypothesis testing step 4
Compare the difference in means from the data with the histogram.  Does your data test statistic look like it comes from the histogram distribution?

Reject or fail to reject the null hypothesis.

### Challenges
- Create and test another hypothesis for the green taxi trip data.